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Abstract. In this paper, we consider the problem of efficiently represent¬ 
ing a set S of n items out of a universe U = {0,..., u — 1} while supporting 
a number of operations on it. Let G = gi...g n be the gap stream associ¬ 
ated with S, gap its bit-size when encoded with gap-encoding, and Hq(G) 
its empirical zero-order entropy. We prove that (1) nHo(G) G o(gap) if G is 
highly compressible, and (2) nHo(G) < nlog(u/n) + n< uHo(S). Let d be 
the number of distinct gap lengths between elements in S. We firstly pro¬ 
pose a new space-efficient zero-order compressed representation of S taking 
n(Ho(G) + 1) + O(dlogu) bits of space. Then, we describe a fully-indexable 
dictionary that supports rank and select queries in 0(log(u/n) + log log u) 
time while requiring asymptotically the same space as the proposed com¬ 
pressed representation of S. 

Keywords: dictionary problem, gap encoding, entropy, compression, rank, 
select 


1 Introduction and Related Work 

The dictionary problem on set data asks to maintain a (space-efficient) data 
structure called indexable dictionary over a set S' = {si,...,s n } C {0, ...,u — 
1} = U, si < S2 < ... < s n , supporting efficiently a range of queries on S. In 
this problem, U is an ordered set and is called universe. As showed by Jacob¬ 
son in his doctoral thesis m, a set of just two operations, rank and select, 
is sufficient and powerful enough in order to derive other fundamental func¬ 
tionalities desired from such a structure: member, successor, and predecessor. 
rank(S, x), with x € U, is the number of elements in S that are smaller than 
or equal to x. select(S,i), where 0 < i < n, is the i-th smallest element in 
S. In this paper, we focus on fully-indexable dictionaries (FIDs), i.e. data 
structures supporting both rank and select operations efficiently. 

Jacobson in [10] proposed a solution for this problem taking u + o(u) 
bits of space and supporting constant-time rank. Constant-time select within 
o{u) bits of additional space was added by Munro [13] and Clark [5]. These 
results were further improved firstly by Pagh [14] (who considered rank) and 
then by Raman et al. 115) ( rank and select ) with structures having the same 
time complexities and requiring only B(n,u) + 0 (it log log it/log it) bits of 
space, where B(n,u ) = [log (“)] is the minimum number of bits required in 
order to distinguish any two size-n subsets of U. Finally, Patra§cu [15] re¬ 
duced the sublinear term to 0{u/polylog{u )) while retaining constant query 
times. Despite these last results being optimal for big values of n, the o(u) 


term can however be much bigger than B(n,u) (even exponentially) if n is 
very small. Moreover, even the B(n,u ) term is not optimal for all instances, 
and can be improved in many cases of practical interest. To see why this 
fact holds true, it is sufficient to notice that zero-order entropy compressors 
encode to the same bit-size all siz e-n subsets S of U, without taking advan¬ 
tage of the structure of S (for example, long or regular distances between 
its elements). This problem motivates the search for more data-aware mea¬ 
sures able to break the B(n. u ) limit in some cases. One of the most widely 
known such data-aware measures is gap [3], which is defined to be the sum 
of all bit-lengths of the distances between consecutive elements in S. If these 
distances are not evenly distributed, gap can be much smaller than B(n,u), 
reaching 10%-40% of B(n,u ) in some instances of practical interest JS]. By 
using logarithmic codes such as Elias (5-encoding [6], the stream of gaps can 
be compressed to gap + o(gap) bits, where the o(gap) overhead comes from 
the prefix property of such codes, needed to unambiguously reconstruct code¬ 
word boundaries. In [9], Gupta et al. show how to build a FID based on 
(5-encoding requiring only gap + 0(n log(u/n)/ log n) + 0(n log log (u/n)) bits 
of space and supporting rank and select in AT(u,n) € o((log log u) 2 ) —this 
is nearly optimal within that space, see m — and O(loglogn) time, respec¬ 
tively. Other recent works iuim showed that constant-time queries can be 
supported using gap + 0(n log log (u/n)) + o(u) bits of space, where the o(u) 
term is 0(u log log u / -y/log u ) i R 111) and 0(u\og\ogu/\ogu) in jl 71 . 

gap reaches its maximum when all gap lengths are equal. However, it is 
clear that in this scenario other techniques (e.g. zero-order entropy compres¬ 
sion) could be flanked to gap encoding in order to turn this worst-case into 
a 0(n)-bits best-case. In this paper we explore the possibility of compressing 
the stream of gaps G to its zero-order empirical entropy Hq(G), aiming at 
obtaining uHq(G) as leading term in the space complexity of our structures. 
Similar techniques are already employed in BWT-based text compression al¬ 
gorithms jl], where runs of zeros in the move-to-front encoding of the BWT 
are compressed using run-length-encoding followed either by zero-order en¬ 
tropy compression or by logarithmic encoding [6] (runs being mostly domi¬ 
nated by small numbers). We firstly observe that nHo(G) £ o(gap) if gaps are 
highly compressible, and prove that nHo(G) does not exceed n\og{u/n) + n 
bits. This bound is provably smaller than the zero-order empirical entropy of 
the set S and of any of its decodable gap-encoded representations. 

These considerations suggest that the data-aware measure uHq(G ) should 
be preferred to gap in cases where the overhead introduced by the zero-order 
compressor (e.g. a codebook) is negligible. Our work goes in this direction. 
First of all, we show a new zero-order compressed representation of bitvectors 
taking uHq{G) + n + 0(d log u ) < uHq + n + 0(y/u log u ) bis of space, where 
u is the length of the bitvector, Hq its zero-order empirical entropy, n the 
number of bits set, and d the number of distinct distances between bits set. d 
is trivially upper-bounded by n and 0(y/u), and is negligible in many practical 
cases (e.g. when S is dense or the gaps are evenly distributed). 
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We finally propose a fully-indexable dictionary that answers rank and 
select queries in 0 (log(it/n) + log log u) time and whose space occupancy 
is of (1 + o(l))nHo(G) + (3 + o(l))n + 0((d + log log u) log u) bits. In all 
cases where Hq{G) € w(l) and d > log log u, this is asymptotically the same 
space as our new bitvector representation. Moreover, if S is dense enough— 
n € 17 (u/polylog(u)) —all queries are supported in O(loglogn) time, which 
is optimal within this space. 

2 Gap-Encoded Dictionaries 

In this section we will assume that u — 1 € S, so that each gap corresponds 
to an element in S (i.e. the element following the gap). If u — 1 ^ S, then 
we can simply use an extra bit to denote this case and encode the final gap 
length separately. We will moreover assume that n < u/2. Logarithms are 
taken in base 2, unless differently specified. In gap encoding , we represent 
the set S = {si,..., s n } C {0, ...,u — 1} = U, s± < s -2 < ... < s n as the 
stream of gaps gi, ...,g n , where g\ = si + 1 and g% = S{ — Sj_i for i > 1. In 
order to reduce space occupancy of the stream, variable-length encoding can 
be used to encode each of the g^. The data-aware measure gap(S) is defined 
as gap(S) = l (U°8 9i J + l), that is, the total number of bits required 

in order to store all g/s using the minimum number of bits to represent 
each gap. When clear from the context, we will simply write gap instead of 
gap(S). Clearly, S cannot be represented using only gap bits since we need 
additional information in order to make the stream uniquely decodable. We 
adopt a notation similar to [9] and indicate with Zq(S )—or simply Zc when 
clear from the context—the decoding overhead (in bits) introduced by the 
coding scheme C. If we use a separate bitvector B marking with a 1 the 
beginning of each code, then we obtain Zb = gap. Another solution is to use 
logarithmic codes such as Elias 7 or d-encoding [ 6 j. In 7 -encoding, we encode 
[log + 1 in unary, followed by the [log gi\ -bits binary representation of 
gi without the most significant 1. Then, = gap — n. A better solution is 
(5-encoding, where we encode with 7 the number [log f/,J + 1, followed by the 
[log gi\ -bits binary representation of gi without the most significant 1. Then, 
Zs = 2 Y^i=i Llog(Ll°g 9i\ + 1 )J bits, log being a concave function, the worst- 
case of gap occurs when g\ = g^ = ... = g n = u/n (by Jensen’s inequality), 
yielding the upper bounds gap < nlog(u/n)+n and Z$ < 2n log(log(u/n)+l). 
Then, one can prove the following (for the original proof, see [ 8 ]): 

Lemma 1. gap < B{n,u) if n < u/2. 

Proof. The claim follows directly from gap < n\og(u/n) + n and from the 
fact that B(n, u ) = nlog(u/n) + nlog e — 0{n/u) + 0(log n) if n < u/2 □ 

Moreover, let Hq(S) = ^ log(^) + log be the zero-order empirical 
entropy of the set S. Since B(n,u) < uHq(S), we have that: 

Corollary 1. gap < uHq(S) if n < u/2. 
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The above inequalities are important as they show that gap encoding can 
never perform worse than zero-order entropy compression. On the other hand, 
experiments show [9] that gap can be significantly smaller than B(n,u ) for 
many cases of interest, thus motivating its use in practical applications. In the 
following section we take one step forward, exploring what happens when we 
treat S as a sequence on the alphabet {g\, ...,g n } and then apply zero-order 
entropy compression to it. 

2.1 A Compressed-Gap Data-Aware Measure 

gap reaches its worst-case of nlog(u/n) + n bits when all gaps have the same 
length. However, it is clear that entropy compression should turn this worst- 
case scenario into a best-case, since the zero-order empirical entropy of such 
a configuration is equal to 0. More formally, let’s consider the following rep¬ 
resentation G of S. We define G to be the sequence 5152 •••<?« € H gap , where 
Zd ga p = { 91 , 92 , ■■■! 9n}- Let moreover d = \H gap \ be the alphabet size and 
f(s) = occ(s)/n, s € Idg a p, be the empirical relative frequency of s in G, 
where occ(s) is the number of occurrences of s in G. We define the zero-order 
empirical entropy of the gaps Hq(G) to be 


Definition 1. H 0 (G) = - J2 s es gap /( s ) lo g (/(«)) 


hHq(G) is the minimum number of bits output by any compressor that 
encodes G assigning a unique code to each symbol in H ga p- First of all, we 
observe that hHq(G) can be significantly smaller than gap: if g\ = 52 = ••• = 
g n = u/n, then n log (u/n) < gap < n log(u/n)+n and hHq(G) = 0. Moreover, 
nHo(G) is never worse than the length of any decodable gap-compressed 
sequence: 

Lemma 2. nHo(G) < gap + Zc, where C is any prefix coding scheme. 

Proof. Follows directly from the fact that no prefix code can compress G in 
less than nH$(G) bits. □ 

Using Lemma [2] and the bounds for gap and Z$ derived in the previous 
section, one can obtain Hq(G) < log (u/n) + 2 log (log (u/n) + 1) + 1. With the 
following theorem we show a much stronger upper bound: 

Theorem 1 . Hq(G) < log (u/n) + 1 

Proof. We want to compute 


max max 

^gapCN>0 f .Sg ap — 


max Hq(G) 


where the alphabet H ga p and the empirical frequency function / must satisfy: 


n 



f(s) ■ s = u 


( 1 ) 
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Let d = \Ugap\- From Definition P and from the concavity of log, we have 
that Hq(G) reaches its maximum Hq(G) = log d when all frequencies are 
equal, i.e. f(s) = d~ l for all s E B gap . We thus have 

max max Hq(G) = max log ci 

TgapCFCo f'.Sgap —tlD EgapC-fi>o,f(s)=d~ L SGUgap 

In order to maximize log d, we now have to find B gap of maximum car¬ 
dinality that satisfies condition ©• It is easy to see that B gap = {l,...,d} 
minimizes ^2 s£ yj gap s = Yli=i * = d(d + l)/ 2 . Since, moreover, /(s) = d -1 for 
all s E S gap , we can rewrite P) as nd~ l (d{d + l)/2 + /c) = u, where k > 0. 
Solving in d, we obtain the set of solutions 

Z = | (b ± \Jb 2 — 8 Am 2 ) / (2 n) \ b = 2u — n A k > o| 

for which we have max2 = (2 u — n)/n when k = 0. This implies that Z gap = 
{1,..., (2 u — n)/n) and f(s) = n/(2u — n ) for all s E maximize Hq(G). 
Our claim follows: 

Hq{G) < log d < log( 2 u/n) = log(u/n) + 1 


□ 

Interestingly, the two measures gap and uHq{G) are upper-bounded by 
the same quantity nlog(tt/n) + n. This is not a trivial result since, differently 
from nHo(G), gap does not include information needed to reconstruct un¬ 
ambiguously codeword boundaries (even though nHo(G), in turn, does not 
include information—e.g. a codebook—needed to decode codewords). Using 
the same arguments of Lemma P and Corollary P we can moreover derive the 
bounds: 

Corollary 2. nHo(G ) < B(n,u ) < uHq(S) if n < u/2 

The pair (U, S) can be represented as a length-u bitvector B with n bits 
set. Let Hq = Hq(S) be the zero-order entropy of B and d be the number of 
distinct distances between bits set in B. Then: 

Corollary 3. There exists a zero-order compressed representation of B tak¬ 
ing u(Hq(G) + 1) + 0(dlog u) < uH 0 + n + 0(d log u) bits of space. 

Proof. Can be easily obtained by compressing the gap sequence with Huffman¬ 
encoding and by applying Corollary O 

Note that the number d of distinct distances between bits set of B is 
trivially upper-bounded by n and 0(y/u) 0 - 

1 Assume, by contradiction, that d £ uj(yG.). Then, the set S ga p of gaps that minimizes 
ap s is Zgap = {1, ...,d}, for which we obtain ff seSgap s = 0{d 2 ) = uj(u). This is 

an absurd since the sum of all gaps cannot exceed u. 
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3 A Compressed-Gap FID 


Let us now turn our attention to fully-indexable dictionary data structures. 
Our aim is to obtain a structure that takes asymptotically the same space as 
the representation described in Corollary [3j 

Our strategy is the following: we use Elias (5-encoding and exploit its 
property of being an asymptotically optimal universal code [6J to encode the 
gap stream in (1 + o(l))nHo(G) + n bits. We then build a two-levels structure 
atop of this representation to support rank and select queries. We adopt an 
approach similar to [5] and firstly describe a binary-searchable dictionary 
(BSD) that supports all queries in 0 (log it) time. The BSD is finally used 
as building block for our final structure, which improves all query times to 
0(log(it/n) + log log it) within the same space. 

Let Egap and / : 2J ga p —>• K + be the set of all gap lengths and the em¬ 
pirical frequencies associated with the gap stream, respectively, and consider 
an (arbitrary) ordering of the symbols ord : S gap —» { 1 , d = \S gap \ 

(i.e. a bijection) such that if ord(gi) < ord(gj) then f{gt) < f(gj) for all 
g % . g 3 € E g ap- Let 5(x), x > 0 be the Elias <5 code of the integer x. Then, 
we associate the code code(gi) = 5(ord(gi )) to each gap length e £ ga p- 
Being 5 an asymptotically optimal universal code | 6 |, the bit length l of the 
compressed stream code(gi)...code(g n ) is at most (1 + o(l))nHo(G) + n bits 
In the following we assume to work under the word RAM model with word 
size @ (log it) bits, so that we can extract any 0 (log u)-bits block from a 
plain bitvector in constant time. We store the bit representations of the com¬ 
pressed gaps sequentially in a bitvector C[0,...,l — 1 ] = code(gi)...code(g n ). 
An additional array D[l,...,d] defined as D[i\ = ord _ 1 (i) (the codebook) is 
moreover built to permit the decoding of codewords. Note that, given the 
starting position of code(gi), 0 < i < n, in the bitvector C, we can extract 
and decode code(gi) = 5(ord(gi )) in 0 ( 1 ) time: firstly, we need to decode 
the 7 -prefix of 5(ord(gi)). This can be done in 0(1) time using two universal 
tables of 0 ( 2 loglog “ log log it) = 0 (log it log log it) bits each (one for the unary 
prefix and the other for the rest of the 7 -prefix of the code). This gives us (i) 
the bit-length of the 7 -prefix of S(ord(gi)), and (ii) the bit-length of ord(gi) 
(without the most significant bit). We can then extract the bits of ord(gi) 
and access D[ord(gi)\ = gt in constant time. To improve readability, in the 
next sections we will implicitly make use of this strategy and—provided that 
we know the starting position of code(gj) in C —say read gap gj instead of 
extract and decode code(gj). 

3.1 A Binary-Searchable Dictionary 

We divide the elements of S = {si,...,s n } into blocks of size t = [logit] 
(we assume for clarity of exposition that t divides n; the following arguments 

2 Even when Ho(G) = 0, with 5-encoding we spend at least 1 bit per symbol, thus the 
additional n term. The o(nHo(G)) term comes from overhead introduced by 5-encoding, 
and in the worst case ( n distinct gaps, Hq(G) £ D(nlogn)) equals ©(nloglogn) bits. 
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can be easily adapted to the general case). For each block {sjt+i, 
i = 0 , ...,n/t — 1 , we store explicitly the smallest element su+i and a pointer 
to the beginning of code(gu+ 2 ) in the bitvector C H. These structures are 
sufficient to obtain our BSD. select(S,i), 0 < i < re, is implemented by 
accessing the Li/iJ-th block and reading i mod t < t gaps in C starting from 
9[i/t]t+2- Then, 



9j 


j=[i/t\t+2 


rank(S, x), x € U = {0, ...,re — 1}, is implemented by binary-searching the 
blocks according to explicitly stored elements sn+ 1 , i = 0 ,...,n/t — 1 , and 
then by extracting gaps in the block of interest until we reach element x. 
More formally, let 0 < i < n/t — 1 be the biggest integer (if any) such that 
Sit+i < x. i can be found by binary search in O(logre) time. If such an integer 
does not exist, then rank(S, x) = 0. Otherwise, let 1 < j <t be the smallest 
integer such that q = Sit+i + J2h =1 9n+i+h > x. j can be found by linear 
search in 0(t ) = O(logw) time. Then, 



The bit-length of C is at most O(nlogu), so a pointer to C takes log re + 
log log u + 0(1) < log u + log log u + 0(1) bits. It follows that for each block 
we spend 2 log u + log log u + 0(1) bits (one element sa and a pointer to C), 
so the blocks take overall (2 log re + log log re + 0(1)) • re/log re = 2 re + o(re) 
bits. We obtain: 

Lemma 3. Let d be the number of distinct gap lengths between elements in 
S. The binary-searchable dictionary described in section \3.1\ occupies (1 + 
o(l))nHo(G) + (3 + o(l))re-|-0((d-|-log log re) log re) bits of space and supports 
rank and select queries in Oilogu ) time. 

Note that the size of the proposed BSD can be exponentially smaller than 
re if S' is sparse. In the next section we show how to obtain O(log(re/re) + 
log log re)-time queries without asymptotically increasing space usage. 

3.2 A Fully-Indexable Dictionary 

Let v = [relog 2 re/re]. The idea is to divide U into blocks of v elements, and 
store a BSD for each block. 

We build a constant-time rank and select succinct bitvector V[0 ,..., [re/re] — 
1] defined as V[i\ = 1 if and only if S (~l {ire,..., (i + l)re — 1} 7 ^ 0. Additionally, 
one array R[ 0,..., [re/re] — 1] stores sampled rank results: R[ 0] = 0 and R[i\ = 
rank(S,iv — 1) for i > 0. We build a binary-searchable dictionary BSD(i ) 

3 We point to code(gu+ 2 ) instead of code(gu+ 1 ) because s«+i is explicitly stored. As a 
matter of fact, we can avoid storing code{gu+ 1 ) in C. 
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for each set Si = {x — iv \ x € S n {iv ,..., (i + l)v — 1 }}, i = 0 , \u/v\ — 1, 
where we use the same codebook D for all the BSD structures (i.e. D is com¬ 
puted according to all gaps gi,...,g n ). Note that there may exist a set S t (or 
more than one) such that its first gap does not belong to {gi,g n }. This 
happens each time an element ,s t is the first of its block b = \_Si/v J > 0, the 
gap g t overlaps blocks b and b — 1, and Si — b ■ v + 1 ^ {< 71 , ...,g n }. However, 
by construction of the BSD data structure (see previous section), the first 
gap in Si is never used (since we store the smallest element of Si explicitly), 
so this event does not affect overall gap frequencies nor space requirements 
of the array D. Finally, one array SEL[ 0,[ ~n/t\ — 1], where t = [log 2 it], 
stores the (number of the) block containing s t t+ \: SEL[i\ = |_ s it+i/ v J> for 
i = 0 ,..., | \n/t\ — 1 . 

Using the above described structures, we can now show how to efficiently 
solve queries. rank(S, x), x € U = {0, ...,u — 1}, is implemented by accessing 
the [x/v\-th. block and calling rank on BSD([x/v\). More formally, 

rank(S, x) = R^/uJ] + fCLnk{S \ x / v \, x mod v) 

where rank{S\ x / v 1 , x mod v) is called on the structure BSD{\x/v\). Rank is 
thus solved in 0( logu) = 0(\og(u/n) + log log u) time. To solve select(S,i), 
we firstly find by binary search the block containing Sj+i, and then call 
select on the corresponding BSD. More in detail, let qi = SEL[[i/t\\ and 
q r = SEL[[i/t\ + 1] if [i/t\ + 1 < [ ~n/t\, q r = qi otherwise. By construction 
of SEL, the block containing element s* + i is one of qi, qi + 1,..., q r . Note that 
the number q r — qi + 1 of blocks of interest can be arbitrary large since there 
may be an arbitrary number of empty blocks among them. However, at most 
t of them will contain at least one element (by construction of SEL). Then, 
we can perform binary search only on the blocks marked with a 1 in the array 
V: during binary search we access blocks at positions of the form select(V, j) 
(note: this is a constant-time select performed on the bitvector V), starting 
with the range j € [ rank(V,qi ) — 1 ,rank(V,q r ) — 1]. Binary search is per¬ 
formed according to partial ranks (array R). Let qi < q m < q r be the biggest 
integer such that R[q m ] < i < R[q m + 1] (if q m + 1 > \ u / v ~\ then simply 
ignore the upper bound in the previous inequality). According to the above 
considerations, q m can be found in O(logt) = O (log log u) time using binary 
search. We can solve select(S , i) as follows: 

select(S, i) = q m ■ v + select(S qm ,i — R[q m ] I) 

where select(S qm ,i — R[q m ]) is called on the structure BSD(q m ). select is 
thus solved on our FID in O(logu) + O(loglogu) = 0(\og{u/ri) + log log u) 
time. 

Bitvector V takes (1 + o(l))u/v = (1 + o(l))n/log 2 u = o(n) bits. Arrays 
R and SEL take logu • u/v = n/logu = o(n) and logu • n/t = n/logu = 
o(n) bits of space, respectively. Finally, all BSD data structures take overall 
(1 + o(l))nHo(G) + (3 + o(l))n bits, and the codebook D and the universal 
tables take 0((d + log log u) logit) bits. We can state our final result: 



Theorem 2. Let d be the number of distinct gap lengths between elements in 
S. The FID described in section [QI takes (1 + o(l))nH^(G ) + (3 + o(l))n + 
0((d + log log u) log u) bits of space and supports rank and select queries in 
0(\og(u/n) + log log ) time. 

The result stated in Theorem [2] improves the space of gun], reducing 
both leading and o(u ) terms from gap+0(nlog\og(u/n)) and u log log u/ log u 
bits to (l+o(l))n#o(G)+(3+o(l))n and 0((d+\og log u) logu) C 0(yJu\ogu) 
bits, respectively. This improvement comes at the price of a 0(\og(u/n) + 
log log u ) slowdown in all query times. Notice that we cannot apply the general 
technique proposed by Makinen and Navarro in m in order to obtain 0(1) 
query times since code() does not (always) satisfy \code(x)\ € 0( logx) (this is 
one of the properties characterizing random access self-delimiting codes HU)- 
An interesting line of research would be to envision a broader class of codes 
(including code( )) for which we can describe a general technique guaranteeing 
constant-time queries. 


4 H 0 (G ) in practice 

In order to assess also in practice the differences between the above discussed 
measures, we adopted the approach of [ 8 ] and simulated several sets, comput¬ 
ing for each of them the number of bits per item required by gap, gap + Zg, 
uHq(S), uHq(G), nHo(G) + Zg, and hHq(G) + Zg + C B, where the last two 
measures refer to Hq(G) plus the overhead introduced by 5-encoding (i.e. en¬ 
coding gi,...,g n as described in the previous section) and by the codebook 
size (CB). 

Gaps were generated according to uniform (Table [T|) and binomial (Ta¬ 
ble [ 2 ]) distributions. Table [T| reports the same experiment performed in [ 8 J 
(except from the facts that we use 6 instead of 7 and we do not consider 
RLE), updated with our measure nHo(G). As expected, in this case uHq(G) 
performs slightly worse than gap when taking into account all encoding over¬ 
heads (columns 3 and 7). This can be explained by the fact that gaps are 
uniform, thus making gap + Zg and uHq(G) + Zg (without the codebook) al¬ 
most equivalent. An interesting fact—in accordance with Theorem |T] -is that, 
even this being its worst case, uHq(G) is always smaller (by about 0.5 bits 
per item) than uHq(S). 

The advantages of using nHo(G ) become evident when non-uniform distri¬ 
butions are used. Table [2] reports the results on binonrially-distributed gapfl 
As expected, in this case our measure considerably improves on gap: if the 
two techniques are compared while taking into account all encoding overheads 
(columns 3 and 7), our strategy requires about 58% the space of gap encoding. 

4 We chose a binomial distribution in order to model a scenario in which gap lengths are 
accumulated around a value g 3> 0 (in this case, g is the mean). Intuitively, in this case 
gap does not perform well because small numbers are not frequent. 
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log (max-gap) 

gap 

gap + Z s 

uH 0 (S) 

nHo(G) 

nH 0 (G) + Z s 

nH 0 (G) + Zs + CB 

1 

1.66717 

3.00151 

2.00103 

1.58496 

2.99842 

2.99848 

2 

2.20164 

3.80142 

2.75854 

2.32191 

3.79349 

3.79364 

3 

2.77733 

5.00151 

3.61667 

3.16987 

4.98418 

4.98454 

4 

3.47452 

6.53906 

4.5389 

4.08735 

6.50696 

6.50781 

5 

4.2771 

7.79638 

5.50097 

5.04417 

7.75575 

7.75773 

6 

5.15079 

8.90439 

6.48606 

6.02187 

8.8685 

8.87305 

7 

6.09095 

10.0028 

7.4809 

7.01044 

9.94679 

9.95711 

8 

7.04186 

11.9893 

8.48908 

8.00377 

11.889 

11.9122 

9 

8.02066 

13.4915 

9.50168 

8.99923 

13.3703 

13.4216 

10 

9.01571 

14.7531 

10.5266 

9.99358 

14.5752 

14.6879 

11 

10.0076 

15.8755 

11.5554 

10.9857 

15.661 

15.9068 

12 

11.0103 

16.9465 

12.599 

11.9707 

16.6565 

17.1892 

13 

12.0031 

17.9701 

13.6584 

12.94 

17.5894 

18.7364 

14 

13.0009 

18.9844 

14.7359 

13.8789 

18.4625 

20.9157 

15 

13.996 

19.9873 

15.839 

14.7427 

19.2538 

24.2575 


Table 1 . Comparison between gap, gap+Zs, uHo(S), nH$(G), nHo(G)+Zg (i.e. accounting 
for the 5 overhead per symbol), and nHo{G ) + Zs + CB (i.e. accounting for the <5 and 
codebook CB overhead per symbol) on randomly-generated sets. Gaps between the n items 
(n affects only the variance of the results; we used n = 10 s ) are uniformly distributed in 
the interval [1, max-gap]. All columns except the first report the number of bits per item 
required by each method. 


log (max-gap) 

gap 

gap + Z s 

uH 0 (S) 

nH 0 (G) 

nH 0 {G) + Zs 

nHo(G) + Zs + CB 

1 

1.74989 

3.24967 

2.22939 

1.50052 

2.50156 

2.50162 

2 

2.25085 

4.12525 

3.16331 

2.03377 

3.00555 

3.0057 

3 

2.88491 

4.94587 

4.18044 

2.5445 

3.49472 

3.49508 

4 

3.77183 

7.31887 

5.22493 

3.04741 

4.094 

4.09485 

5 

4.70015 

8.69979 

6.27376 

3.54494 

4.82176 

4.82326 

6 

5.64788 

9.64788 

7.31532 

4.04711 

5.61441 

5.61679 

7 

6.60309 

10.6031 

8.3491 

4.54742 

6.3782 

6.3822 

8 

7.57464 

12.7239 

9.37466 

5.04947 

7.08812 

7.09424 

9 

8.55226 

14.5523 

10.3937 

5.54834 

7.75208 

7.76178 

10 

9.53716 

15.5372 

11.4078 

6.04518 

8.33989 

8.35386 

11 

10.5229 

16.5229 

12.4178 

6.54489 

8.93035 

8.95219 

12 

11.516 

17.516 

13.425 

7.04343 

9.56187 

9.59411 

13 

12.5135 

18.5134 

14.4301 

7.54485 

10.3296 

10.3775 

14 

13.5082 

19.5082 

15.4338 

8.03851 

11.1441 

11.2149 

15 

14.5084 

20.5084 

16.4364 

8.53758 

11.9996 

12.1044 


Table 2. Comparison between gap, gap+Zs, uHo(S), nHo(G), nHo(G) + Zs, and nHo(G) + 
Zs + CB on randomly-generated sets. Gaps between the n items (n = 10 s ) are binomially 
distributed in the (shifted) interval [1, max+jap] with success probability p = 1/2. All 
columns except the first report the number of bits per item required by each method. 
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5 Conclusions 


In this paper we introduced Hq(G), a new data-aware measure based on the 
idea of compressing the gaps between elements of a set S C {0, ...,u — 1}. 
We provided new theoretical upper-bounds for this measure, and showed 
that in practice—if the gap stream is compressible— Hq(G) considerably im¬ 
proves space usage of gap encoding techniques combined with logarithmic 
codes such as Elias d-encoding. Finally, we proposed a new zero-order rep¬ 
resentation of bitvectors based on our new measure and a compressed-gap 
fully-indexable dictionary supporting fast queries and taking small space in 
addition to uHq{G). 

As expected, simulations confirmed that the proposed compressed-gap 
measure is particularly convenient in situations where the gaps follow a non- 
uniform distribution or they are dominated mainly by large numbers. The 
main drawback of nHo(G) seems to be the overhead introduced by the zero- 
order compressor, which in our solution is of 0(y/ulog u) bits in the worst 
case. However, in some practical applications this overhead—being propor¬ 
tional to the number d of distinct gap lengths—is expected to be negligible 
with respect to the overall structure size. One example of such an application 
is run-length compression of the BWT of highly repetitive text collections 
(e.g. genome variants), where run lengths are expected to scale linearly with 
the number of documents in the collection H2HB3. 

We plan to implement our FID and test it against state-of-the-art practi¬ 
cal gap-encoded bitvector representations (e.g. sd_vector of SDSL[7]). Notice 
that in practice Huffman-compression of the gaps should be preferred to uni¬ 
versal delta-encoding, as the additional overhead is much smaller (i.e. we 
can remove the o(jiHq(G)) term). Our FID could find a first application in 
repetition-aware self-indexing, e.g. by using it as building block of a more 
space-efficient run-length compressed suffix array (RLCSA[T5|). 
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